QUESTIONS
This is a long post, so I will highlight my main two questions now before giving details:
How can one succinctly allow for optional matched parentheses/brackets around an expression?
How does one properly parse the content of nested_expr? This answer suggests that this function is not quite appropriate for this, and infix_notation is better, but that doesn't seem to fit my use case (I don't think).
DETAILS
I am working on a grammar to parse prolog strings. The data I have involves a lot of optional brackets or parentheses.
For example, both predicate([arg1, arg2, arg3]) and predicate(arg1, arg2, arg3) are legal and appear in the data.
My full grammar is a little complicated, and likely could be cleaned up, but I will paste it here for reproducibility. I have a couple versions of the grammar as I found new data that I had to account for. The first one works with the following example string:
pred(Var, arg_name1:arg#arg_type, arg_name2:(sub_arg1, sub_arg2))
For some visual clarity, I am turning the parsed strings into graphs, so this is what this one should look like:
Note that the arg2:(sub_arg1, sub_arg1) is slightly idiosyncratic syntax where the things inside the parens are supposed to be thought of as having an AND operator between them. The only thing indicating this is the fact that this wrapped expression essentially appears "naked" (i.e. has no predicate name of its own, it's just some values lumped together with parens).
VERSION 1: works on the above string
# GRAMMAR VER 1
predication = pp.Forward()
join_predication = pp.Forward()
entity = pp.Forward()
args_list = pp.Forward()
# atoms are used either as predicate names or bottom level argument values
# str_atoms are just quoted strings which may also appear as arguments
atom = pp.Word(pp.alphanums + '_' + '.')
str_atom = pp.QuotedString("'")
# TYPICAL ARGUMENT: arg_name:ARG_VALUE, where the ARG_VALUE may be an entity, join_predication, predication, or just an atom.
# Note that the arg_name is optional and may not always appear
# EXAMPLES:
# with name: pred(arg1:val1, arg2:val2)
# without name: pred(val1, val2)
argument = pp.Group(pp.Opt(atom("arg_name") + pp.Suppress(":")) + (entity | join_predication | predication | atom("arg_value") | str_atom("arg_value")))
# List of arguments
args_list = pp.Opt(pp.Suppress("[")) + pp.delimitedList(argument) + pp.Opt(pp.Suppress("]"))
# As in the example string above, sometimes predications are grouped together in parentheses and are meant to be understood as having an AND operator between them when evaluating the truth of both together
# EXAMPLE: pred(arg1:(sub_pred1, subpred2))
# I am just treating it as an args_list inside obligatory parentheses
join_predication <<= pp.Group(pp.Suppress("(") + args_list("args_list") + pp.Suppress(")"))("join_predication")
# pred_name with optional arguments (though I've never seen one without arguments, just in case)
predication <<= pp.Group(atom("pred_name") + pp.Suppress("(") + pp.Opt(args_list)("args_list") + pp.Suppress(")"))("predication")
# ent_name with optional arguments and a #type
entity <<= (pp.Group(((atom("ent_name")
+ pp.Suppress("(") + pp.Opt(args_list)("args_list") + pp.Suppress(")"))
| str_atom("ent_name") | atom("ent_name"))
+ pp.Suppress("#") + atom("type"))("entity"))
# starter symbol
lf_fragment = entity | join_predication | predication
Although this works, I came across another very similar string which used brackets instead of parentheses for a join_predication:
pred(Var, arg_name1:arg#arg_type, arg_name2:[sub_arg1, sub_arg2])
This broke my parser seemingly because the brackets are used in other places and because they are often optional, it could mistakenly match one with the wrong parser element as I am doing nothing to enforce that they must go together. For this I thought to turn to nested_expr, but this caused further problems because as mentioned in this answer, parsing the elements inside of a nested_expr doesn't work very well, and I have lost a lot of the substructure I need for the graphs I'm building.
VERSION 2: using nested_expr
# only including those expressions that have been changed
# args_list might not have brackets
args_list = pp.nested_expr("[", "]", pp.delimitedList(argument)) | pp.delimitedList(argument)
# join_predication is an args_list with obligatory wrapping parens/brackets
join_predication <<= pp.nested_expr("(", ")", args_list("args_list"))("join_predication") | pp.nested_expr("[", "]", args_list("args_list"))("join_predication")
I likely need to ensure matching for predication and entity, but haven't for now.
Using the above grammar, I can parse both example strings, but I lose the named structure that I had before.
In the original grammar, parse_results['predication']['args_list'] was a list of every argument, exactly as I expected. In the new grammar, it only contains the first argument, Var, in the example strings.
I'm trying to match type annotations like int | str, and use regex substitution to replace them with a string Union[int, str].
Desired substitutions (before and after):
str|int|bool -> Union[str,int,bool]
Optional[int|tuple[str|int]] -> Optional[Union[int,tuple[Union[str,int]]]]
dict[str | int, list[B | C | Optional[D]]] -> dict[Union[str,int], list[Union[B,C,Optional[D]]]]
The regular expression I've come up with so far is as follows:
r"\w*(?:\[|,|^)[\t ]*((?'type'[a-zA-Z0-9_.\[\]]+)(?:[\t ]*\|[\t ]*(?&type))+)(?:\]|,|$)"
You can try it out here on Regex Demo. It's not really working how I'd want it to. The problems I've noted so far:
It doesn't seem to handle nested Union conditions so far. For example, int | tuple[str|int] | bool seems to result in one match, rather than two matches (including the inner Union condition).
The regex seems to consume unnecessary ] at the end.
Probably the most important one, but I noticed the regex subroutines don't seem to be supported by the re module in Python. Here is where I got the idea to use that from.
Additional Info
This is mainly to support the PEP 604 syntax for Python 3.7+, which requires annotatations to be forward-declared (e.g. declared as strings) to be supported, as otherwise builtin types don't support the | operator.
Here's a sample code that I came up with:
from __future__ import annotations
import datetime
from decimal import Decimal
from typing import Optional
class A:
field_1: str|int|bool
field_2: int | tuple[str|int] | bool
field_3: Decimal|datetime.date|str
field_4: str|Optional[int]
field_5: Optional[int|str]
field_6: dict[str | int, list[B | C | Optional[D]]]
class B: ...
class C: ...
class D: ...
For Python versions earlier than 3.10, I use a __future__ import to avoid the error below:
TypeError: unsupported operand type(s) for |: 'type' and 'type'
This essentially converts all annotations to strings, as below:
>>> A.__annotations__
{'field_1': 'str | int | bool', 'field_2': 'int | tuple[str | int] | bool', 'field_3': 'Decimal | datetime.date | str', 'field_4': 'str | Optional[int]', 'field_5': 'Optional[int | str]', 'field_6': 'dict[str | int, list[B | C | Optional[D]]]'}
But in code (say in another module), I want to evaluate the annotations in A. This works in Python 3.10, but fails in Python 3.7+ even though the __future__ import supports forward declared annotations.
>>> from typing import get_type_hints
>>> hints = get_type_hints(A)
Traceback (most recent call last):
eval(self.__forward_code__, globalns, localns),
File "<string>", line 1, in <module>
TypeError: unsupported operand type(s) for |: 'type' and 'type'
It seems the best approach to make this work, is to replace all occurrences of int | str (for example) with Union[int, str], and then with typing.Union included in the additional localns used to evaluate the annotations, it should then be possible to evaluate PEP 604- style annotations for Python 3.7+.
You can install the PyPi regex module (as re does not support recursion) and use
import regex
text = "str|int|bool\nOptional[int|tuple[str|int]]\ndict[str | int, list[B | C | Optional[D]]]"
rx = r"(\w+\[)(\w+(\[(?:[^][|]++|(?3))*])?(?:\s*\|\s*\w+(\[(?:[^][|]++|(?4))*])?)+)]"
n = 1
res = text
while n != 0:
res, n = regex.subn(rx, lambda x: "{}Union[{}]]".format(x.group(1), regex.sub(r'\s*\|\s*', ',', x.group(2))), res)
print( regex.sub(r'\w+(?:\s*\|\s*\w+)+', lambda z: "Union[{}]".format(regex.sub(r'\s*\|\s*', ',', z.group())), res) )
Output:
Union[str,int,bool]
Optional[Union[int,tuple[Union[str,int]]]]
dict[Union[str,int], list[Union[B,C,Optional[D]]]]
See the Python demo.
The first regex finds all kinds of WORD[...] that contain pipe chars and other WORDs or WORD[...] with no pipe chars inside them.
The \w+(?:\s*\|\s*\w+)+ regex matches 2 or more words that are separated with pipes and optional spaces.
The first pattern details:
(\w+\[) - Group 1 (this will be kept as is at the beginning of the replacement): one or more word chars and then a [ char
(\w+(\[(?:[^][|]++|(?3))*])?(?:\s*\|\s*\w+(\[(?:[^][|]++|(?4))*])?)+) - Group 2 (it will be put inside Union[...] with all \s*\|\s* pattern replaced with ,):
\w+ - one or more word chars
(\[(?:[^][|]++|(?3))*])? - an optional Group 3 that matches a [ char, followed with zero or more occurrences of one or more [ or ] chars or whole Group 3 recursed (hence, it matches nested parentheses) and then a ] char
(?:\s*\|\s*\w+(\[(?:[^][|]++|(?4))*])?)+ - one or more occurrences (so the match contains at least one pipe char to replace with ,) of:
\s*\|\s* - a pipe char enclosed with zero or more whitespaces
\w+ - one or more word chars
(\[(?:[^][|]++|(?4))*])? - an optional Group 4 (matches the same thing as Group 3, note the (?4) subroutine repeats Group 4 pattern)
] - a ] char.
Just an update, but I was at last able to come with a (fully working) non-regex approach to this problem. The reason it took me this long, is because it actually required some intense thought and deliberation on my part. In fact this was not done easily; it took me two days of intermittent work to actually piece all of it together, and also for me to be able to fully wrap my head around what I was trying to accomplish.
The regex solution put forth by #Wiktor, is the accepted answer for now, and it works really well in general. I actually (going back later) found that there were only a few edge cases that it wasn't able to handle, which I go over here. However, there were a few reasons I had to wonder if a non-regex solution would perhaps be the better choice:
My actual use case is that I'm building a library (package), and so I want to reduce on dependencies if possible. The huge bummer is that the regex module is an external dependency, which is not negligible in size either; in my case, I would probably need to add this dependency as an extra feature to my library.
The regex matching seems to not be as fast or efficient as I had hoped. Don't get me wrong, it's still incredibly fast for matching the complex use cases mentioned in the post (about 1-3ms on average), but given a lot of annotations for a class, I could understand that this would quickly add up. Therefore, I had this suspicion that a a non-regex approach would almost certainly be faster, and was curious to test that out.
Therefore, I am posting the non-regex implementation that I was able to cobble together below. This solves my original problem of converting Union type annotations such as X|Y into annotations like Union[X, Y], and also goes above and beyond to also support more complex use cases that I found that the regex implementation actually does not account for. I still prefer the regex version as I believe it is vastly simpler to this, and for the majority of cases I believe that it will end up working perfectly and without issue.
However, note this is the first and only non-regex implementation I have been able to put together for this specific problem. And without further ado, here goes:
from typing import Iterable, Dict, List
# Constants
OPEN_BRACKET = '['
CLOSE_BRACKET = ']'
COMMA = ','
OR = '|'
def repl_or_with_union(s: str):
"""
Replace all occurrences of PEP 604- style annotations (i.e. like `X | Y`)
with the Union type from the `typing` module, i.e. like `Union[X, Y]`.
This is a recursive function that splits a complex annotation in order to
traverse and parse it, i.e. one that is declared as follows:
dict[str | Optional[int], list[list[str] | tuple[int | bool] | None]]
"""
return _repl_or_with_union_inner(s.replace(' ', ''))
def _repl_or_with_union_inner(s: str):
# If there is no '|' character in the annotation part, we just return it.
if OR not in s:
return s
# Checking for brackets like `List[int | str]`.
if OPEN_BRACKET in s:
# Get any indices of COMMA or OR outside a braced expression.
indices = _outer_comma_and_pipe_indices(s)
outer_commas = indices[COMMA]
outer_pipes = indices[OR]
# We need to check if there are any commas *outside* a bracketed
# expression. For example, the following cases are what we're looking
# for here:
# value[test], dict[str | int, tuple[bool, str]]
# dict[str | int, str], value[test]
# But we want to ignore cases like these, where all commas are nested
# within a bracketed expression:
# dict[str | int, Union[int, str]]
if outer_commas:
return COMMA.join(
[_repl_or_with_union_inner(i)
for i in _sub_strings(s, outer_commas)])
# We need to check if there are any pipes *outside* a bracketed
# expression. For example:
# value | dict[str | int, list[int | str]]
# dict[str, tuple[int | str]] | value
# But we want to ignore cases like these, where all pipes are
# nested within the a bracketed expression:
# dict[str | int, list[int | str]]
if outer_pipes:
or_parts = [_repl_or_with_union_inner(i)
for i in _sub_strings(s, outer_pipes)]
return f'Union{OPEN_BRACKET}{COMMA.join(or_parts)}{CLOSE_BRACKET}'
# At this point, we know that the annotation does not have an outer
# COMMA or PIPE expression. We also know that the following syntax
# is invalid: `SomeType[str][bool]`. Therefore, knowing this, we can
# assume there is only one outer start and end brace. For example,
# like `SomeType[str | int, list[dict[str, int | bool]]]`.
first_start_bracket = s.index(OPEN_BRACKET)
last_end_bracket = s.rindex(CLOSE_BRACKET)
# Replace the value enclosed in the outermost brackets
bracketed_val = _repl_or_with_union_inner(
s[first_start_bracket + 1:last_end_bracket])
start_val = s[:first_start_bracket]
end_val = s[last_end_bracket + 1:]
return f'{start_val}{OPEN_BRACKET}{bracketed_val}{CLOSE_BRACKET}{end_val}'
elif COMMA in s:
# We are dealing with a string like `int | str, float | None`
return COMMA.join([_repl_or_with_union_inner(i)
for i in s.split(COMMA)])
# We are dealing with a string like `int | str`
return f'Union{OPEN_BRACKET}{s.replace(OR, COMMA)}{CLOSE_BRACKET}'
def _sub_strings(s: str, split_indices: Iterable[int]):
"""Split a string on the specified indices, and return the split parts."""
prev = -1
for idx in split_indices:
yield s[prev+1:idx]
prev = idx
yield s[prev+1:]
def _outer_comma_and_pipe_indices(s: str) -> Dict[str, List[int]]:
"""Return any indices of ',' and '|' that are outside of braces."""
indices = {OR: [], COMMA: []}
brace_dict = {OPEN_BRACKET: 1, CLOSE_BRACKET: -1}
brace_count = 0
for i, char in enumerate(s):
if char in brace_dict:
brace_count += brace_dict[char]
elif not brace_count and char in indices:
indices[char].append(i)
return indices
I've tested it against the common use cases listed in the question above, as well as more complex use cases that even the regex implementation seemed to wrestle with.
For example, given these sample test cases:
test_cases = """
str|int|bool
Optional[int|tuple[str|int]]
dict[str | int, list[B | C | Optional[D]]]
dict[str | Optional[int], list[list[str] | tuple[int | bool] | None]]
tuple[str|OtherType[a,b|c,d], ...] | SomeType[str | int, list[dict[str, int | bool]]] | dict[str | int, str]
"""
for line in test_cases.strip().split('\n'):
print(repl_or_with_union(line).replace(',', ', '))
Then the result is as below (note that I've replaced , with , so it's a bit easier to read)
Union[str, int, bool]
Optional[Union[int, tuple[Union[str, int]]]]
dict[Union[str, int], list[Union[B, C, Optional[D]]]]
dict[Union[str, Optional[int]], list[Union[list[str], tuple[Union[int, bool]], None]]]
Union[tuple[Union[str, OtherType[a, Union[b, c], d]], ...], SomeType[Union[str, int], list[dict[str, Union[int, bool]]]], dict[Union[str, int], str]]
Now the only ones that the regex implementation wasn't able to correctly parse were the last two cases, which are arguably pretty complex to begin with. Here are the regex solutions for the last two - which unfortunately aren't how we'd want them (again, I've ensured there's a space after each comma so it's a bit easier to read)
dict[Union[str, Optional][int], list[Union[list[str], tuple[Union[int, bool]], None]]]
tuple[Union[str, OtherType][a, Union[b, c], d], ...] | SomeType[Union[str, int], list[dict[str, Union[int, bool]]]] | dict[Union[str, int], str]
Maybe it's worth going over why those cases weren't handled as expected with the regex version? My suspicion, and actually confirmed after testing, is that any value in a | expression that contains a brackets [] appears to not parse correctly. For example, str | Optional[int] parses as Union[str,Optional][int] currently, but ideally that would be handled like Union[str,Optional[int]].
I've boiled down the two test cases above to abbreviated forms below, for which I was able to confirm that the regex didn't handle as expected:
str | Optional[int]
tuple[str|OtherType[a,b|c,d], ...] | SomeType[str]
When parsing via the regex implementation, these are the current results. Note that in one of the results, the | character also appears, however ideally we would strip that out as Python versions earlier than 3.10 wouldn't be able to evaluate a pipe | expression against builtin types.
Union[str,Optional][int]
tuple[Union[str,OtherType][a,Union[b,c],d], ...] | SomeType[str]
The desired end result (that the non-regex approach seems to resolve as expected, after I fixed it to handle such cases when testing) is as follows:
Union[str, Optional[int]]
Union[tuple[Union[str,OtherType[a,Union[b,c],d]], ...], SomeType[str]]
Lastly, I've also been able to time it against the regex approach above. I was myself curious how this solution would fare against the regex version, which is arguably much simpler and easier to understand.
The code I tested with is given below:
def regex_repl_or_with_union(text):
rx = r"(\w+\[)(\w+(\[(?:[^][|]++|(?3))*])?(?:\s*\|\s*\w+(\[(?:[^][|]++|(?4))*])?)+)]"
n = 1
res = text
while n != 0:
res, n = regex.subn(rx, lambda x: "{}Union[{}]]".format(x.group(1), regex.sub(r'\s*\|\s*', ',', x.group(2))),
res)
return regex.sub(r'\w+(?:\s*\|\s*\w+)+', lambda z: "Union[{}]".format(regex.sub(r'\s*\|\s*', ',', z.group())), res)
test_cases = """
str|int|bool
Optional[int|tuple[str|int]]
dict[str | int, list[B | C | Optional[D]]]
"""
def non_regex_solution():
for line in test_cases.strip().split('\n'):
_ = repl_or_with_union(line)
def regex_solution():
for line in test_cases.strip().split('\n'):
_ = regex_repl_or_with_union(line)
n = 100_000
print('Non-regex: ', timeit('non_regex_solution()', globals=globals(), number=n))
print('Regex: ', timeit('regex_solution()', globals=globals(), number=n))
The results - run on an Alienware PC, AMD Ryzen 7 3700X 8-core processor /w 16GB memory:
Non-regex: 2.0510589000186883
Regex: 31.39290289999917
So, the non-regex implementation I came up with actually turned out to be on average about 15x faster than the regex implementation, which was hard to believe. The best news to me is that it doesn't involve additional dependencies. I will likely move forward and utilize the the non-regex solution for now, and note that this is mainly as I would like to reduce on project dependencies if possible. Great thanks again to #Wiktor and all those who helped out with this problem, and helped steer me towards a solution!
I'm trying to write PyParsing code capable of parsing any Python code (I know that the AST module exists, but that will just be a starting point - I ultimately want to parse more than just Python code.)
Anyways, I figure I'll just start by writing something able to parse the classic
print("Hello World!")
So here's what I wrote:
from pyparsing import (alphanums, alphas, delimitedList, Forward,
quotedString, removeQuotes, Suppress, Word)
expr = Forward()
string = quotedString.setParseAction(removeQuotes)
call = expr + Suppress('(') + Optional(delimitedList(expr)) + Suppress(')')
name = World(alphas + '_', alphanums + '_')
expr <<= string | name | call
test = 'print("Hello World!")'
print(expr.parseString(test))
When I do that, though, it just spits out:
['print']
Which is technically a valid expr - you can type that into the REPL and there's no problem parsing it, even if it's useless.
So I thought maybe what I would want is to flip around name and call in my expr definition, so it would prefer returning calls to names, like this:
expr <<= string | call | name
Now I get a maximum recursion depth exceeded error. That makes sense, too:
Checks if it's an expr.
Checks if it's a string, it's not.
Checks if it's a call.
It must start with an expr, return to start of outer list.
So my question is... how can I define call and expr so that I don't end up with an infinite recursion, but also so that it doesn't just stop when it sees the name and ignore the arguments?
Is Python code too complicated for PyParsing to handle? If not, is there any limit to what PyParsing can handle?
(Note - I've included the general tags parsing, abstract-syntax-tree, and bnf, because I suspect this is a general recursive grammar definition problem, not something necessarily specific to pyparsing.)
Your grammar is left recursive: expr expects a call which expects an expr which expects a call... If PyParsing can't handle left recursion, you need to change the grammar to something that PyParsing can work with.
One approach to remove direct left recursion is to change a gramar rule such us:
A = A b | c
into
A = c b*
In your case, left recursion is indirect: it doesn't happen in expr, but in a sub rule (call):
E = C | s | n
C = E x y z
To remove indirect left recursion you usually "lift" the definition of the sub-rule to the main rule. Unfortunatelly this removes the offending sub rule from the grammar -- in other words, you lose some structural expressiveness when you do that.
The previous example, with indirect recursion removed, would look like this:
E = E x y z | s | n
At this point, you have direct left recursion, which is easier to transform. When you deal with that, the result would be something like this -- in pseudo EBNF:
E = (s | n) (x y z)*
In your case, the definition of Expr would become:
Expr = (string | name) Args*
Args = "(" ExprList? ")"
ExprList = Expr ("," Expr)*
I'm trying to implement a forward pipe functionality, like bash's | or R's recent %>%. I've seen this implementation https://mdk.fr/blog/pipe-infix-syntax-for-python.html, but this requires that we define in advance all the functions that might work with the pipe. In going for something completely general, here's what I've thought of so far.
This function applies its first argument to its second (a function)
def function_application(a,b):
return b(a)
So for example, if we have a squaring function
def sq(s):
return s**2
we could invoke that function in this cumbersome way function_application(5,sq). To get a step closer to a forward pipe, we want to use function_application with infix notation.
Drawing from this, we can define an Infix class so we can wrap functions in special characters such as |.
class Infix:
def __init__(self, function):
self.function = function
def __ror__(self, other):
return Infix(lambda x, self=self, other=other: self.function(other, x))
def __or__(self, other):
return self.function(other)
Now we can define our pipe which is simply the infix version of the function function_application,
p = Infix(function_application)
So we can do things like this
5 |p| sq
25
or
[1,2,3,8] |p| sum |p| sq
196
After that long-winded explanation, my question is if there is any way to override the limitations on valid function names. Here, I've named the pipe p, but is it possible to overload a non-alphanumeric character? Can I name a function > so my pipe is |>|?
Quick answer:
You can't really use |>| in python, at the bare minimum you need | * > * | where * needs to be a identifier, number, string, or another expression.
Long answer:
Every line is a statement (simple or compound), a stmt can be a couple of things, among them an expression, an expression is the only construct that allows the use of or operator | and greater than comparison > (or all operators and comparisons for that matter < > <= >= | ^ & >> << - + % / //), every expression needs a left hand side and a right hand side, ultimatelly being in the form lhs op rhs, both left and right hand side could be another expression, but the exit case is the use of an primary (with the exception of unnary -, ~ and + that need just a rhs), the primary will boil down to an identifier, number or string, so, at the end of the day you are required to have an identifier [a-zA-Z_][a-zA-Z_0-9]* along side a |.
Have you considered a different approach, like one class that override the or operator instead of a infix class? I have a tiny library that does piping, might interest you
For reference, here is the full grammar:
https://docs.python.org/2/reference/grammar.html
I was looking for a way to do this too. So I created a Python library called Pypework.
You just add a prefix such as f. to the beginning of each function call to make it pipeable. Then you can chain them together using the >> operator, like so:
"Lorem Ipsum" >> f.lowercase >> f.replace(" ", "_") # -> "lorem_ipsum"
Or across multiple lines if wrapped in parentheses, like so:
(
"Lorem Ipsum"
>> f.lowercase
>> f.replace(" ", "_")
)
# -> "lorem_ipsum"