Infinite recursion caused by multiple occurrences of a parsing item YACC-PLY

Infinite recursion caused by multiple occurrences of a parsing item YACC-PLY - python

I'm dealing with a Yacc (the ply one) and I have no idea how to make more occurrences of a parsing item without making the program crashing due to infinite recursion.
Let's say I have:
def p_Attribute(p):
''' Attribute : STRING
| NUMBER
| Attribute
| empty '''
[do stuff]
NOTE:
The question is similar to: Python PLY zero or more occurrences of a parsing item but the solution proposed there is not working, I always have infinite recursion.

The problem here is actually in your grammar. Yacc-like parsers don't work with this kind of rule as it requires reducing an Attribute to reduce an Attribute, hence the infinite recursion. (Edit to add note: it would be OK if the right hand Attribute had some non-empty non-ambiguous token required, e.g., Attribute : STRING | NUMBER | '$' Attribute | empty is parseable and allows any number of $ signs in front of the other acceptable Attributes. But as it is, both alternatives, Attribute and empty, can be completely empty. That's a reduce/reduce conflict and it resolves badly here. I think the reduce/reduce conflict might resolve "as desired" if you put the empty rule first—but that's still the "wrong way" to do this, in general.)
Presumably you want one of three things (but I'm not sure which, so imagine the parser generator's confusion :-) ):
zero or more Attributes in sequence (equivalent to regexp-like "x*")
one or more Attributes in sequence (equivalent to regexp-like "x+")
zero or one Attribute, but no more (equivalent to regexp-like "x?")
For all three of these you should, in general, start by defining a single non-terminal that recognizes exactly one valid attribute-like-thing, e.g.:
Exactly_One_Attribute : STRING | NUMBER
(which I'm going to just spell Attribute below).
Then you define a rule that accepts what you intend to allow for your sequence (or optional-attribute). For instance:
Zero_Or_More_Attributes : Zero_Or_More_Attributes Attribute | empty
(This uses "left recursion", which should be the most efficient. Use right recursion—see below—only if you really want to recognize items in the other order.)
To require at least one attribute:
One_Or_More_Attributes: One_Or_More_Attributes Attribute | Attribute
(also left-recursive in this example), or:
Attribute_opt : empty | Attribute
which allows either "nothing" (empty) or exactly one attribute.
The right-recursive version is simply:
Zero_Or_More_Attributes : Attribute Zero_Or_More_Attributes | empty
As a general rule, when using right recursion, the parser winds up having to "shift" (push onto its parse stack) more tokens. Eventually the parser comes across a token that fails to fit the rule (in this case, something not a STRING or NUMBER) and then it can begin reducing each shifted token using the right-recursive rule, working on the STRING-and-NUMBERs right to left. Using left recursion, it gets to do reductions earlier, working left to right. See
http://www.gnu.org/software/bison/manual/html_node/Algorithm.html#Algorithm for more.

Related

How many values can be returned from a function?

In python you can do this:
def myFunction():
return "String", 5.5, True, 11
val1, val2, val3, val4 = myFunction()
I argue that this is a function which returns 4 values, however my python instructor says I am wrong and that this only returns one tuple.
Personally I think this is a distinction without a difference because there is no indication that these four values are converted into a tuple and then deconstructed as 4 values. I am unaware of how this is any different than the same type of construct in languages like JavaScript.
Am I right?

I'll take a stab at this, since I think the issue here is a lack of defining terms properly. To be completely pedantic about it, the correct answer to the question as worded is "zero". Python doesn't return a value; it returns an object, and an object is not the same thing as a value. Going back to basics on this one:
https://docs.python.org/3/reference/datamodel.html#objects-values-and-types
Objects are Python’s abstraction for data. All data in a Python
program is represented by objects or by relations between objects.
also:
Every object has an identity, a type and a value.
A value, as defined above, is something different than an object. Functions return objects and not values, so the answer to the question (as asked, if taken to literal extremes) is zero. If the question had been, "How many objects are returned to the caller by this function?" then the answer would be one. This is why defining terms is important, and why vague questions generate multiple (potentially correct) answers. In another sense, the correct answer to the question would be five, because there's five things one might think of as "values" coming back from this function. There's a tuple, and there's the four items inside the tuple. In still another sense, the answer is four (as you've said) because the code flat out says return and then has four values afterwards.
So really, you're both right, and you're both wrong, but only because the question isn't sufficiently clear as to what it wants to know. The instructor is likely trying to put forth the idea that Python returns single objects, which may contain multiple other objects. This is important to know, because it contributes to Python's flexibility when passing data around. I'm not so sure the way the instructor worded it is achieving that goal, but I'm also not present in the class, so it's tough to say. Ideally, instruction should cover neurodiverse ways of understanding, but I'll save that soapbox for a different discussion.
Let's distill it like this in hopes of providing a clear summary. Python doesn't return values, it returns objects. To that end, a function can only return one object. That object can contain multiple values, or even refer to other objects, so while your function can pass back multiple values to the caller, it must do so inside of a single object. "Objects" are the internal unit of data inside of Python, and are distinctly different from the "value" contained in the object, so it's a good practice in Python to always keep in mind the distinction between the two and how they're used, regardless of how the question is worded.

Your instructor is correct.
https://docs.python.org/3/reference/datamodel.html#objects-values-and-types
Tuples
The items of a tuple are arbitrary Python objects. Tuples of two or more items are formed by comma-separated lists of expressions. A tuple of one item (a ‘singleton’) can be formed by affixing a comma to an expression (an expression by itself does not create a tuple, since parentheses must be usable for grouping of expressions). An empty tuple can be formed by an empty pair of parentheses.
Applied to your example:
foo = "String", 5.5, True, 11
assert isinstance(foo, tuple)
So you're clearly returning a single object.
Here's how the assignment expression is specified when there are multiple targets (variables on the left hand side):
https://docs.python.org/3/reference/simple_stmts.html#assignment-statements
Assignment of an object to a target list, optionally enclosed in parentheses or square brackets, is recursively defined as follows.
If the target list is a single target with no trailing comma, optionally in parentheses, the object is assigned to that target.
Else: The object must be an iterable with the same number of items as there are targets in the target list, and the items are assigned, from left to right, to the corresponding targets.
(All emphasis mine)

Ternary Operator Python single variable assignment python 3.8

I am encountering a weird situation in Python and would like some advice. For some business reasons, we need to keep this python code module to shortest number of lines. Long story --- but it gets into a requirement that this code module is printed out and archived on paper. I didnt make the rules -- just need to pay the mortgage.
We are reading a lot of data from a mainframe web service and applying some business rules to the data. For example and "plain English" business rule would be
If the non resident state value for field XXXXXX is blank or shorter than two character [treat as same], the value for XXXXXX must be set to "NR". Evaluation must treat the value as non resident unless the residence has been explicitly asserted.
I would like to use ternary operators for some of these rules as they will help condense the over lines of code. I have not use ternary's in python3 for this type of work and I am missing something or formatting the line wrong
mtvartaxresXXX = "NR" if len(mtvartaxresXXX)< 2
does not work.
This block (classic python) does work
if len(mtvartaxresXXX) < 2:
mtvartaxresXXX = "NR"
What is the most "pythonish" way to perform this evaluation on a single line if statement.
Thanks

You can simply write the if statement on a single line:
if len(mtvartaxresXXX) < 2: mtvartaxresXXX = "NR"
This is the same number of lines as the ternary, and doesn't require an explicit else value, so it's fewer characters.

How is PLY's parsetab.py formatted?

I'm working on a project to convert MATLAB code to Python, and have been somewhat successful after building off others work. The tool uses PLY (an implementation of lex and yacc parsing tools for Python) to parse the MATLAB input. Unfortunately, it is a requirement that my code is written in Python 3, not Python 2. The tool runs without issue in Python 2, but I get a strange error in Python 3 (Assuming A is an array):
log_idx = A <= 16;
^
SyntaxError: Unexpected "=" (parser)
The MATLAB code I am trying to convert is:
idx = A <= 16;
which should convert to almost the same thing in Python 3:
idx = A <= 16
The only real difference between the Python 3 code and the Python 2 code is the PLY-generated parsetab.py file, which has substantial differences in the following variables:
_tabversion
_lr_signature
_lr_action_items
_lr_goto_items
I'm having trouble understanding the purpose of these variables and why they could be different when the only difference was the Python version used to generate the parsetab.py file.
I tried searching for documentation on this, but was unsuccessful. I originally suspected it could be a difference in the way strings are formatted between Python 2 and Python 3, but that didn't turn anything up either. Is there anyone familiar with PLY that could give some insight into how these variables are generated, or why the Python version is creating this difference?
Edit: I'm not sure if this would be useful to anyone because the file is very long and cryptic, but below is an example of part of the first lines of _lr_action_items and _lr_goto_items
Python 2:
_lr_action_items = {'DOTDIV':([6,9,14,20,22,24,32,34,36,42,46,47,52,54,56,57,60,71,72,73,74,75 ...
_lr_goto_items = {'lambda_args':([45,80,238,],[99,161,263,]),'unwind':([1,8,28,77,87,160,168,177 ...
Python 3:
_lr_action_items = {'END_STMT':([0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,26,27,39,41,48,50 ...
_lr_goto_items = {'top':([0,],[1,]),'stmt':([1,44,46,134,137,207,212,214,215,244,245,250 ...

I'm going to go out on a limb here, because you have provided practically no indication of what code you are actually using. So I'm just going to assume that you copied the lexer.py file from the github repository you linked to in your question.
There's an important clue in this error message:
log_idx = A <= 16;
^
SyntaxError: Unexpected "=" (parser)
Evidently, <= is not being scanned as a single token; otherwise, the parser would not see an = token at that point in the input. This can only mean that the scanner is returning two tokens, < and =, and if that's the case, it is most certainly a syntax error, as you would expect from
log_idx = A < = 16;
To figure out why the lexer would do this, it's important to understand how the Ply (default) lexer works. It gathers up all the lexer patterns from variables whose names start t_, which must be either functions or variables whose values are strings. It then sorts them as follows:
function docstrings, in order by line number in the source file.
string values, in reverse order by length.
See Specification of Tokens in the Ply manual.
That usually does the right thing, but not always. The intention of sorting in reverse order by length is that a prefix pattern will come after a pattern which matches a longer string. So if you had patterns '<' and '<=', '<=' would be tried first, and so in the case where the input had <=, the < pattern would never be tried. That's important, since if '<' is tried first, '<=' will never be recognised.
However, this simple heuristic does not always work. The fact that a regular expression is shorter does not necessarily mean that its match will be shorter. So if you expect "maximal munch" semantics, you sometimes have to be careful about your patterns. (Or you can supply them as docstrings, because then you have complete control over the order.)
And whoever created that lexer.py file was not careful about their patterns, because it includes (among other issues):
t_LE = r"<="
t_LT = r"\<"
Note that since these are raw strings, the backslash is retained in the second string, so both patterns are of length 2:
>>> len(r"\<")
2
>>> len(r"<=")
2
Since the two patterns have the same length, their relative order in the sort is unspecified. And it is quite possible that the two versions of Python produce different sort orders, either because of differences in the implementation of sort or because of differences in the order which the dictionary of variables is iterated, or some combination of the above.
< has no special significance in a Python regular expression, so there is no need to backslash-escape it in the definition of t_LT. (Clearly, since it is not backslash-escaped in t_LE.) So the simplest solution would be to make the sort order unambiguous by removing the backslash:
t_LE = r"<="
t_LT = r"<"
Now, t_LE is longer and will definitely be tried first.
That's not the only instance of this problem in the lexer file, so you might want to revise it carefully.
Note: You could also fix the problem by adding an unnecessary backslash to the t_LE pattern; there is an argument for taking the attitude, "When in doubt, escape." However, it is useful to know which characters need to be escaped in a Python regex, and the Python documentation for the re package contains a complete list. Also, consider using long raw strings for patterns which include quotes, since neither " nor ' need to be backslash escaped in a Python regex.

how to avoid python numeric literals beginning with "0" being treated as octal?

I am trying to write a small Python 2.x API to support fetching a
job by jobNumber, where jobNumber is provided as an integer.
Sometimes the users provide ajobNumber as an integer literal
beginning with 0, e.g. 037537. (This is because they have been
coddled by R, a language that sanely considers 037537==37537.)
Python, however, considers integer literals starting with "0" to
be OCTAL, thus 037537!=37537, instead 037537==16223. This
strikes me as a blatant affront to the principle of least
surprise, and thankfully it looks like this was fixed in Python
3---see PEP 3127.
But I'm stuck with Python 2.7 at the moment. So my users do this:
>>> fetchJob(037537)
and silently get the wrong job (16223), or this:
>>> fetchJob(038537)
File "<stdin>", line 1
fetchJob(038537)
^
SyntaxError: invalid token
where Python is rejecting the octal-incompatible digit.
There doesn't seem to be anything provided via __future__ to
allow me to get the Py3K behavior---it would have to be built-in
to Python in some manner, since it requires a change to the lexer
at least.
Is anyone aware of how I could protect my users from getting the
wrong job in cases like this? At the moment the best I can think
of is to change that API so it take a string instead of an int.

At the moment the best I can think of is to change that API so it take a string instead of an int.
Yes, and I think this is a reasonable option given the situation.
Another option would be to make sure that all your job numbers contain at least one digit greater than 7 so that adding the leading zero will give an error immediately instead of an incorrect result, but that seems like a bigger hack than using strings.
A final option could be to educate your users. It will only take five minutes or so to explain not to add the leading zero and what can happen if you do. Even if they forget or accidentally add the zero due to old habits, they are more likely to spot the problem if they have heard of it before.

Perhaps you could take the input as a string, strip leading zeros, then convert back to an int?
test = "001234505"
test = int(test.lstrip("0")) # 1234505

finding all float literals in Python code

I am trying to find all occurrences of a literal float value in Python code. Can I do that in Komodo (or in any other way)?
In other words, I want to find every line where something like 0.0 or 1.5 or 1e5 is used, assuming it is interpreted by Python as a float literal (so no comments, for example).
I'm using Komodo 6.0 with Python 3.1.
If possible, a way to find string and integer literals would be nice to have as well.

Our SD Source Code Search Engine (SCSE) can easily do this.
SCSE is a tool for searching large source code bases, much faster than grep, by indexing the elements of the source code languages of interest. Queries can then be posed, which use the index to enable fast location of search hits. Queries and hits are displayed in a GUI, and a click on a hit will show the block of source code containing the hit.
The SCSE knows the lexical structure of each language it has indexed with the precision as that langauge's compiler. (It uses front ends from family of accurate programming language processors; this family is pretty large and happens to include the OP's target langauge of Python/Perl/Java/...). Thus it knows exactly where identifiers, comments, and literals (integral, float, character or string) are, and exactly their content.
SCSE queries are composed of commands representing sequences of language elements of interest. The query
'for' ... I '=' N=103
finds a for keyword near ("...") an arbitrary identifier(I) which is initialized ("=") with the numeric value ("N") of 103. Because SCSE understands the language structure, it ignores language-whitespace between the tokens, e.g., it can find this regardless off intervening blanks, whitespace, newlines or comments.
The query tokens I, N, F, S, C represent I(dentifier), Natural (number), F(loat), S(tring) and C(omment) respectively. The OP's original question, of finding all the floats, is thus the nearly trivial query
F
Similarly for finding all String literals ("S") and integral literals ("N"). If you wanted to find just copies of values near Pi, you'd add low and upper bound constraints:
F>3.14<3.16
(It is pretty funny to run this on large Fortran codes; you see all kinds of bad approximations of Pi).
SCSE won't find a Float in a comment or a string, because it intimately knows the difference. Writing a grep-style expression to handle all the strange combinations to eliminate whitespace or surrounding quotes and commente delimiters should be obviously a lot more painful. Grep ain't the way to do this.

You could do that by selecting what you need with regular expressions.
This command (run it on a terminal) should do the trick:
sed -r "s/^([^#]*)#.*$/\1/g" YOUR_FILE | grep -P "[^'\"\w]-?[1-9]\d*[.e]\d*[^'\"\w]"
You'll probably need to tweak it to get a better result.
`sed' cuts out comments, while grep selects only lines containing (a small subsect of - the expression I gave is not perfect) float values...
Hope it helps.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Infinite recursion caused by multiple occurrences of a parsing item YACC-PLY - python

Related

How many values can be returned from a function?

Ternary Operator Python single variable assignment python 3.8

How is PLY's parsetab.py formatted?

how to avoid python numeric literals beginning with "0" being treated as octal?

finding all float literals in Python code

Categories

Resources