Stack Overflow when Pyparsing Ada 2005 Scoped Identifiers using Reference Manual Grammar - python

I'm currently implementing an Ada 2005 parser using Pyparsing and the reference manual grammar rules. We need this in order to analyze and transform parts of our aging Ada-codebase to C/C++.
Most things work.
However, one little annoying problem remains:
The grammar rule name when parsing scoped identifiers (rule selected_component) such as the expression "Global_Types.Integer2" fails because it is part of a left-associative grammar rule cycle.
I believe this rule is incorrectly written: the sub-rule direct_name should be placed after the sub-rule direct_name. In fact it should be placed last in the list of alternatives. Otherwise direct_name and in turn name matches and "Global_Types" only and then expects the string to end after that. Not what I want.
Therefore I now move the rule direct_name to the end of name-alternatives...but then I instead get an Pyparsing infinite recursion and Python spits out maximum recursion depth exceeded.
I believe the problem is caused either by the fact that
the associativity of the grammar rule selected_component is right-to-left. I've search the reference manual of Pyparsing but haven't found anything relevant. Should we treat dot (.) as an operator with right-to-left associativity or can we solve it throught extensions and restructurings of the grammar rules?
or by the fact that there is no check in Pyparsing infinite recursions. I believe this wouldn't be too hard to implement. Use a map from currently active rules (functions) to source position/offset (getTokensEndLoc()) and always fail a rule if the current source input position/offset equals the position related to the rule just entered.
Recursive expressions with pyparsing may be related to my problem.
The problem also seems closely related to Need help in parsing part of python grammar
which unfortunately doesn't have answers yet.
Here's the Ada 2005 grammar rule cycle that causes infinite recursion:
name =>
selected_component =>
prefix =>
name
Note that this problem is not an Ada-specific issue but is related to all grammars containing left-recursive rules.

For reference, as noted in GNAT: The GNU Ada Compiler, §2.2 The Parser, "The Ada grammar given [in] the ARM is ambiguous, and a table-driven parser would be forced to modify the grammar to make it acceptable to LL (1) or LALR (1) techniques."

Related

What is the idea/notion behind using '#' as comments in Python while C uses '#' for pre-processor directive?

My guess:
In Python:
// was used for floor division and they couldn't come up with any other alternative symbol for floor division so they couldn't use // for comments in Python.
# was an available character to be used in Python as there is no concept of pre-processing, and they made a choice to use # for comments.
I'm afraid your assumption is false: the floor division operator // is quite recent in Python, whereas # comments are part of the original design.
The origin of # comments is older:
the early unix shells (the original sh command started in 1971, csh from 1978 and the Bourne shell 1979) introduced the use of # to start line comments, followed by all later unix shells.
Many script like programming languages already used # for comments: sed (1974), make (1976), awk (1977) ...
# was also used for comments in configuration files
later scripting languages followed the same convention: TCL (1988), Perl (1988), Python (1991), PHP (1994), Ruby (1995) and more recently Cobra, Seed7, Windows PowerShell, R, Maple, Elixir, Julia, Nim...
Regarding the C preprocessor, here is a quote from Dennis M. Ritchie's own memories of The Development of the C Language about the origin of the C preprocessor and the true meaning of the # character:
Many other changes occurred around 1972-3, but the most important was the introduction of the preprocessor, partly at the urging of Alan Snyder [Snyder 74], but also in recognition of the utility of the the file-inclusion mechanisms available in BCPL and PL/I. Its original version was exceedingly simple, and provided only included files and simple string replacements: #include and #define of parameterless macros. Soon thereafter, it was extended, mostly by Mike Lesk and then by John Reiser, to incorporate macros with arguments and conditional compilation. The preprocessor was originally considered an optional adjunct to the language itself. Indeed, for some years, it was not even invoked unless the source program contained a special signal at its beginning. This attitude persisted, and explains both the incomplete integration of the syntax of the preprocessor with the rest of the language and the imprecision of its description in early reference manuals.
# was used as a special character at the beginning of a C source file to determine if the preprocessor was to be invoked.
PL/I file include directives use %INCLUDE and BCPL uses GET "libhdr"
C was by no means the only prior art available when Guido was choosing the details of Python language syntax. # is in fact a pretty common comment-introduction character, especially for scripting languages. Examples include the Bourne family of shells, the Csh family of shells, Perl, awk, and sed, all of which predate Python. I have always supposed that this aspect of Python's syntax was most influenced by this fairly large group of languages.
Whatever the influences were, they did not include consideration of a conflict with the use of // for floor division, as that operator was not introduced until much later.
The use of // comments dates back to 1967 [or earlier].
I came across Keith Thompson's answer: With arrays, why is it the case that a[5] == 5[a]?
In it are links to language reference manuals for B and BCPL at Bell Labs:
The B language manual from 1972 (precursor to C): User's Reference to B
The BCPL language manual from 1967 (precursor to B): Martin Richards's BCPL Reference Manual, 1967
From a link there, we have a link to a PDF file that is a transcription of MIT Project MAC Memorandum-M-352
From that memorandum (the BCPL manual), in section 2.1.2 (b):
2.1.2 Hardware Conventions and Preprocessor Rules
(a) If the implementation character set contains both capital and small letters
then the following conventions hold:
(1) A name is either a single small letter or a sequence of letters and
digits starting with a capital letter. The character immediately
following a name may not be a letter or a digit.
(2) A sequence of two or more small letters which is not part of a NAME,
SECTBRA, SECTKET or STRINGCONST is a reserved system word and may be
used to represent a canonical symbol.For example:
let and logor could be used to represent LET and LOGOR but Let and
Logor are names.
(b) User’s comment may be included in a program between a double slash '//' and
the end of the line. Example:
let R[] be // This routine refills the vector Symb
§ for i = 1 to 200 do Readch [INPUT, lv Symb*[i]] §

How is PLY's parsetab.py formatted?

I'm working on a project to convert MATLAB code to Python, and have been somewhat successful after building off others work. The tool uses PLY (an implementation of lex and yacc parsing tools for Python) to parse the MATLAB input. Unfortunately, it is a requirement that my code is written in Python 3, not Python 2. The tool runs without issue in Python 2, but I get a strange error in Python 3 (Assuming A is an array):
log_idx = A <= 16;
^
SyntaxError: Unexpected "=" (parser)
The MATLAB code I am trying to convert is:
idx = A <= 16;
which should convert to almost the same thing in Python 3:
idx = A <= 16
The only real difference between the Python 3 code and the Python 2 code is the PLY-generated parsetab.py file, which has substantial differences in the following variables:
_tabversion
_lr_signature
_lr_action_items
_lr_goto_items
I'm having trouble understanding the purpose of these variables and why they could be different when the only difference was the Python version used to generate the parsetab.py file.
I tried searching for documentation on this, but was unsuccessful. I originally suspected it could be a difference in the way strings are formatted between Python 2 and Python 3, but that didn't turn anything up either. Is there anyone familiar with PLY that could give some insight into how these variables are generated, or why the Python version is creating this difference?
Edit: I'm not sure if this would be useful to anyone because the file is very long and cryptic, but below is an example of part of the first lines of _lr_action_items and _lr_goto_items
Python 2:
_lr_action_items = {'DOTDIV':([6,9,14,20,22,24,32,34,36,42,46,47,52,54,56,57,60,71,72,73,74,75 ...
_lr_goto_items = {'lambda_args':([45,80,238,],[99,161,263,]),'unwind':([1,8,28,77,87,160,168,177 ...
Python 3:
_lr_action_items = {'END_STMT':([0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,26,27,39,41,48,50 ...
_lr_goto_items = {'top':([0,],[1,]),'stmt':([1,44,46,134,137,207,212,214,215,244,245,250 ...
I'm going to go out on a limb here, because you have provided practically no indication of what code you are actually using. So I'm just going to assume that you copied the lexer.py file from the github repository you linked to in your question.
There's an important clue in this error message:
log_idx = A <= 16;
^
SyntaxError: Unexpected "=" (parser)
Evidently, <= is not being scanned as a single token; otherwise, the parser would not see an = token at that point in the input. This can only mean that the scanner is returning two tokens, < and =, and if that's the case, it is most certainly a syntax error, as you would expect from
log_idx = A < = 16;
To figure out why the lexer would do this, it's important to understand how the Ply (default) lexer works. It gathers up all the lexer patterns from variables whose names start t_, which must be either functions or variables whose values are strings. It then sorts them as follows:
function docstrings, in order by line number in the source file.
string values, in reverse order by length.
See Specification of Tokens in the Ply manual.
That usually does the right thing, but not always. The intention of sorting in reverse order by length is that a prefix pattern will come after a pattern which matches a longer string. So if you had patterns '<' and '<=', '<=' would be tried first, and so in the case where the input had <=, the < pattern would never be tried. That's important, since if '<' is tried first, '<=' will never be recognised.
However, this simple heuristic does not always work. The fact that a regular expression is shorter does not necessarily mean that its match will be shorter. So if you expect "maximal munch" semantics, you sometimes have to be careful about your patterns. (Or you can supply them as docstrings, because then you have complete control over the order.)
And whoever created that lexer.py file was not careful about their patterns, because it includes (among other issues):
t_LE = r"<="
t_LT = r"\<"
Note that since these are raw strings, the backslash is retained in the second string, so both patterns are of length 2:
>>> len(r"\<")
2
>>> len(r"<=")
2
Since the two patterns have the same length, their relative order in the sort is unspecified. And it is quite possible that the two versions of Python produce different sort orders, either because of differences in the implementation of sort or because of differences in the order which the dictionary of variables is iterated, or some combination of the above.
< has no special significance in a Python regular expression, so there is no need to backslash-escape it in the definition of t_LT. (Clearly, since it is not backslash-escaped in t_LE.) So the simplest solution would be to make the sort order unambiguous by removing the backslash:
t_LE = r"<="
t_LT = r"<"
Now, t_LE is longer and will definitely be tried first.
That's not the only instance of this problem in the lexer file, so you might want to revise it carefully.
Note: You could also fix the problem by adding an unnecessary backslash to the t_LE pattern; there is an argument for taking the attitude, "When in doubt, escape." However, it is useful to know which characters need to be escaped in a Python regex, and the Python documentation for the re package contains a complete list. Also, consider using long raw strings for patterns which include quotes, since neither " nor ' need to be backslash escaped in a Python regex.

What metasyntax notation is Python using?

Full grammar specification for Python 3.6.3 is as follows: https://docs.python.org/3/reference/grammar.html
It looks like EBNF appended by some special constructs taken from regular expressions, for example: ()* (repeat zero or more times?) and ()+ (repeat one or more times?).
What metasyntax is Python using and where its specification can be found?
Update
Python's grammar is defined in this file (thanks #larsks). However, the question still stands - what notation is used?
The Python grammar is parsed by the parser in the Parser directory of the source. You can see this in Makefile.pre. This generates Include/graminit.[ch], which are used in, e.g., Python/ast.c as well as Modules/parsermodule.c.
The format of the grammar is described at the bottom of pgen.c:
Input is a grammar in extended BNF (using * for repetition, + for
at-least-once repetition, [] for optional parts, | for alternatives and
() for grouping).

What are `lexpr` and `ApplicationExpression` nltk?

What exactly does lexpr mean and what do the folloring r'/F x.x mean? Also what is Application Expression?
from nltk.sem.logic import *
lexpr = Expression.fromstring
zero = lexpr(r'\F x.x')
one = lexpr(r'\F x.F(x)')
two = lexpr(r'\F x.F(F(x))')
three = lexpr(r'\F x.F(F(F(x)))')
four = lexpr(r'\F x.F(F(F(F(x))))')
succ = lexpr(r'\N F x.F(N(F,x))')
plus = lexpr(r'\M N F x.M(F,N(F,x))')
mult = lexpr(r'\M N F.M(N(F))')
pred = lexpr(r'\N F x.(N(\G H.H(G(F)))(\u.x)(\u.u))')
v1 = ApplicationExpression(succ, zero).simplify()
See http://goo.gl/zog68k, nltk.sem.logic.Expression is:
"""This is the base abstract object for all logical expressions"""
There are many types of logical expressions implemented in nltk. See line 1124, the ApplicationExpression is:
This class is used to represent two related types of logical expressions.
The first is a Predicate Expression, such as "P(x,y)". A predicate expression is comprised of a FunctionVariableExpression or
ConstantExpression as the predicate and a list of Expressions as the arguments.
The second is a an application of one expression to another, such as
"(\x.dog(x))(fido)".
The reason Predicate Expressions are treated as Application Expressions is
that the Variable Expression predicate of the expression may be replaced
with another Expression, such as a LambdaExpression, which would mean that
the Predicate should be thought of as being applied to the arguments.
The logical expression reader will always curry arguments in a application expression.
So, "\x y.see(x,y)(john,mary)" will be represented internally as
"((\x y.(see(x))(y))(john))(mary)". This simplifies the internals since
there will always be exactly one argument in an application.
The str() method will usually print the curried forms of application
expressions. The one exception is when the the application expression is
really a predicate expression (ie, underlying function is an
AbstractVariableExpression). This means that the example from above
will be returned as "(\x y.see(x,y)(john))(mary)".
I'm not exactly an expert in formal logics but your code above is trying to declare a logical function variable x:
>>> from nltk.sem.logic import *
>>> lexpr = Expression.fromstring
>>> zero = lexpr(r'\F x.x')
>>> succ = lexpr(r'\N F x.F(N(F,x))')
>>> v1 = ApplicationExpression(succ, zero).simplify()
>>> v1
<LambdaExpression \F x.F(x)>
>>> print v1
\F x.F(x)
For a crash course, see http://theory.stanford.edu/~arbrad/slides/cs156/lec2-4.pdf and a nltk crash course to lambda expressions, see http://www.cs.utsa.edu/~bylander/cs5233/nltk-intro.pdf
You are looking at a small part of quite a complicted toolkit. I try to give some background from a bit of researching on the web below. Or you can just skip to the "direct answers" section if you like. I'll try to answer your question on the specific part you quote, but I am not an expert on either philosophical logic or natural language processing. The more I read about it, the less I seem to know, but I've included a load of hopefully useful references.
Description of tool / principles/ introduction
The code you've posted is a sub-series of the regression tests for the logic module of the Natural Language toolkit for python (NLTK). This toolkit is described in a fairly accessible academic paper here, seemingly written by the authors of the tool. It describes the motivation for the toolkit and writing the logic module - in a nutshell to help automate interpretation of natural language.
The code you've posted defines a number of logical forms (LFs as they are referred to in the paper I linked). LFs cover statements in First order predicate logic, combined with the lambda operator (i.e. first order lambda calculus). I will not attempt to completely describe First order predicate logic here. There's a tutorial on lambda calculus here.
The code comes from a set of regression tests (i.e. demonstrations that the toolbox works correctly on simple, known exmample tests) on the howto page, demonstrating how the toolbox can be demonstrated by using it to do simple arithmetic operations. They are an exact encoding of this approach to arithmetic via lambda calculus (Wikipedia link) in the nltk toolkit.
The first four are the first four numbers in lambda calculus (Church Encoding). The next four are arithmetic operators - succ (successor), plus (addition), mult (multiplication) and pred (division), You have not got the tests that go along with these, so at the moment, you simply have a number of LFs, followed by one example of Lambda calculus, combining two of these LFs (succ and zero) to get v1. as you have applied succ to zero, the result should be one - and that is what they test for on the howto page - i.e. v1 == one should evaluate True.
Direct answer to python bits
Lets go through the elements of the code you've posted one by one.
lexpr is the function that generates Logical EXPRessions - it is an alias for Expression.fromstring as lexpr = Expression.fromstring
It takes a string argument. The r before the string tells python to interpret it as a raw string literal. For the purposes of this question - that means that we don't have to escape the \ symbol
Within the Strings, \ is the lambda operator.
F denotes a function and x a bound variable in lambda calculus
The . or dot operator separates the bound function from the body of the expression / abstraction
So - to take the string you quote in the question:
r'/F x.x'
It is the Church Encoding of zero. Church encoding is pretty abstract and hard to get your head round. This tutorial might help - I think I'm starting to get it... Unfortunately the example you've chosen is zero and from what I can work out, this is a definition, rather than something you can derive. It can't be "evaluated to 0" in any meaningful sense. This is the simplest explanation I've found. I'm not in a position to comment on its rigour / correctness.
A Church numeral is a procedure that takes one argument, and that argument is itself another procedure that also takes one argument. The procedure zero represents the integer 0 by returning a procedure that applies its input procedure zero times
Finally, the ApplicationExpression is taking one expression and applying it to the other, in this case applying succ (succesor) to zero. This is, aptly, called an application in lambda calculus
EDIT:
Wrote all that and then found a book hidden on the nltk site - Chapter 10 is particularly applicable to this question, with this section describing lambda calculus.

Safe expression parser in Python

How can I allow users to execute mathematical expressions in a safe way?
Do I need to write a full parser?
Is there something like ast.literal_eval(), but for expressions?
The examples provided with Pyparsing include several expression parsers:
https://github.com/pyparsing/pyparsing/blob/master/examples/fourFn.py is a conventional arithmetic infix notation parser/evaluator implementation using pyparsing. (Despite its name, this actually does 5-function arithmetic, plus several trig functions.)
https://github.com/pyparsing/pyparsing/blob/master/examples/simpleBool.py is a boolean infix notation parser/evaluator, using a pyparsing helper method operatorPrecedence, which simplifies the definition of infix operator notations.
https://github.com/pyparsing/pyparsing/blob/master/examples/simpleArith.py and https://github.com/pyparsing/pyparsing/blob/master/examples/eval_arith.py recast fourFn.py using operatorPrecedence. The first just parses and returns a parse tree; the second adds evaluation logic.
If you want a more pre-packaged solution, look at plusminus, a pyparsing-based extensible arithmetic parsing package.
What sort of expressions do you want? Variable assignment? Function evaluation?
SymPy aims to become a full-fledged Python CAS.
Few weeks ago I did similar thing, but for logical expressions (or, and, not, comparisons, parentheses etc.). I did this using Ply parser. I have created simple lexer and parser. Parser generated AST tree that was later use to perform calculations. Doing this in that way allow you to fully control what user enter, because only expressions that are compatible with grammar will be parsed.
Yes. Even if there were an equivalent of ast.literal_eval() for expressions, a Python expression can be lots of things other than just a pure mathematical expression, for example an arbitrary function call.
It wouldn't surprise me if there's already a good mathematical expression parser/evaluator available out there in some open-source module, but if not, it's pretty easy to write one of your own.
maths functions will consist of numeric and punctuation characters, possible 'E' or 'e' if you allow scientific notation for rational numbers, and the only (other) legal use of alpha characters will be if you allow/provide specific maths functions (e.g. stddev). So, should be trivial to run along the string for alpha characters and check the next little bit isn't suspicious, then simply eval the string in a try/except block.
Re the comments this reply has received... I agree this approach is playing with fire. Still, that doesn't mean it can't be done safely. I'm new to python (< 2 months), so may not know the workarounds to which this is vulnerable (and of course a new Python version could always render the code unsafe in the future), but - for what little it's worth (mainly my own amusement) - here's my crack at it:
def evalMaths(s):
i = 0
while i < len(s):
while s[i].isalpha() and i < len(s):
idn += s[i]
i += 1
if (idn and idn != 'e' and idn != 'abs' and idn != 'round'):
raise Exception("you naughty boy: don't " + repr(idn))
else:
i += 1
return eval(s)
I would be very interested to hear if/how it can be circumvented... (^_^) BTW / I know you can call functions like abs2783 or _983 - if they existed, but they won't. I mean something practical.
In fact, if anyone can do so, I'll create a question with 200 bounty and accept their answer.

Categories

Resources