Function Parser with RegEx in Python

Function Parser with RegEx in Python - python

I have a source code in Fortran (almost irrelevant) and I want to parse the function names and arguments.
eg using
(\w+)\([^\(\)]+\)
with
a(b(1 + 2 * 2), c(3,4))
I get the following: (as expected)
b, 1 + 2 * 2
c, 3,4
where I would need
a, b(1 + 2 * 2), c(3,4)
b, 1 + 2 * 2
c, 3,4
Any suggestions?
Thanks for your time...

It can be done with regular expressions-- use them to tokenize the string, and work with the tokens. i.e. see re.Scanner. Alternatively, just use pyparsing.

This is a nonlinear grammar -- you need to be able to recurse on a set of allowed rules. Look at pyparsing to do simple CFG (Context Free Grammar) parsing via readable specifications.
It's been a while since I've written out CFGs, and I'm probably rusty, so I'll refer you to the Python EBNF to get an idea of how you can construct one for a subset of a language syntax.
Edit: If the example will always be simple, you can code a small state machine class/function that iterates over the tokenized input string, as #Devin Jeanpierre suggests.

You can take a look at PLY (Python Lex-Yacc), it's (in my opinion) very simple to use and well documented, and it comes with a calculator example which could be a good starting point.

I don't think this is a job for regular expressions... they can't really handle nested patterns.
This is because regexes are compiled into FSMs (Finite State Machines). In order to parse arbitrarily nested expressions, you can't use a FSM, because you need infinitely many states to keep track of the arbitrary nesting. Also see this SO thread.

You can't do this with regular expression only. It's sort of recursive. You should match first the most external function and its arguments, print the name of the function, then do the same (match the function name, then its arguments) with all its arguments. Regex alone are not enough.

Related

Can we add/change scope with a plugin in Sublime Text 3?

In a custom *.sublime-syntax file, I have this:
- match: '(\d+)/(\d+)x (.+)$'
captures:
1: constant.numeric.owned.items_tracker
2: constant.numeric.needed.items_tracker
3: variable.description.items_tracker
I would like to set capture 3 scope to variable.description.done.items_tracker instead if capture 1 is greater than capture 2
I think is not possible to do this in sublime-syntax ; so, can I do this with python in a plugin and how?

In Sublime, syntax definitions are the only way to apply scopes, and the scopes are applied by using regular expression matches such as the ones outlined in your question.
The Syntax facility doesn't have any sort of direct "conditional" logic that would allow it to take a programmatic action like converting the matches to integers and then compare them and do something different.
Additionally although a plugin can modify the source of a file at will, it can't apply scopes; that is strictly the purview of the syntax definition itself.
As such, what you want to do is not directly possible in the general case. A potential workaround might be to have as many rules as there are combinations of numeric values, but that's not practical unless the total range of values and the spreads between them is very small.

How to change priority in math order(asterisk)

I want users to input math formula in my system. How can convert case1 formula to case2 formula using Python? In another word, I would like to change math order specifically for double asterisks.
#case1
3*2**3**2*5
>>>7680
#case2
3*(2**3)**2*5
>>>960

Not only is this not something that Python supports, but really, why would you want to? Modifying BIDMAS or PEMDAS (depending on your location), would not only give you incorrect answers, but also confuse the hell out of any devs looking at the code.
Just use brackets like in Case 2, it's what they're for.

If users are supposed to enter formulas into your program, I would suggest keeping it as is. The reason is that exponentiation in mathematics is right-associative, meaning the execution goes from the top level down. For example: a**b**c = a**(b**c), by convention.
There are some programs that use bottom-up resolution of the stacked exponentiation -- MS Excel and LibreOffice are some of them, however, it is against the regular convention, and always confused the hell out of me.
If you would like to override this behavior, and still be mathematically correct, you have to use brackets.
You can always declare your own power method that would resolve the way you want it -- something like numpy.pow(). You could overload the built-in, but that's too much hastle.
Read this

Below is the example to achieve this using re as:
expression = '3*2**3**2*5'
asterisk_exprs = re.findall(r"\d+\*\*\d+", expression) # List of all expressions matching the criterion
for expr in asterisk_exprs:
expression = expression.replace(expr, "({})".format(expr)) # Replace the expression with required expression
# Value of variable 'expression': '3*(2**3)**2*5'
In order to evaluate the mathematical value of str expression, use eval as:
>>> eval(expression)
960

Pretty-print Lisp using Python

Is there a way to pretty-print Lisp-style code string (in other words, a bunch of balanced parentheses and text within) in Python without re-inventing a wheel?

Short answer
I think a reasonable approach, if you can, is to generate Python lists or custom objects instead of strings and use the pprint module, as suggested by #saulspatz.
Long answer
The whole question look like an instance of an XY-problem. Why? because you are using Python (why not Lisp?) to manipulate strings (why not data-structures?) representing generated Lisp-style code, where Lisp-style is defined as "a bunch of parentheses and text within".
To the question "how to pretty-print?", I would thus respond "I wouldn't start from here!".
The best way to not reinvent the wheel in your case, apart from using existing wheels, is to stick to a simple output format.
But first of all all, why do you need to pretty-print? who will look at the resulting code?
Depending on the exact Lisp dialect you are using and the intended usage of the code, you could format your code very differently. Think about newlines, indentation and maximum width of your text, for example. The Common Lisp pretty-printer is particulary evolved and I doubt you want to have the same level of configurability.
If you used Lisp, a simple call to pprint would solve your problem, but you are using Python, so stick with the most reasonable output for the moment because pretty-printing is a can of worms.
If your code is intended for human readers, please:
don't put closing parenthesis on their own lines
don't vertically align open and close parenthesis
don't add spaces between opening parenthesis
This is ugly:
( * ( + 3 x )
(f
x
y
)
)
This is better:
(* (+ 3 x)
(f x y))
Or simply:
(* (+ 3 x) (f x y))
See here for more details.
But before printing, you have to parse your input string and make sure it is well-formed. Maybe you are sure it is well-formed, due to how you generate your forms, but I'd argue that the printer should ignore that and not make too many assumptions. If you passed the pretty-printer an AST represented by Python objects instead of just strings, this would be easier, as suggested in comments. You could build a data-structure or custom classes and use the pprint (python) module. That, as said above, seems to be the way to go in your case, if you can change how you generate your Lisp-style code.
With strings, you are supposed to handle any possible input and reject invalid ones.
This means checking that parenthesis and quotes are balanced (beware of escape characters), etc.
Actually, you don't need to really build an intermediate tree for printing (though it would probably help for other parts of your program), because Lisp-style code is made of forms that are easily nested and use a prefix notation: you can scan your input string from left-to-right and print as required when seeing parenthesis (open parenthesis: recurse; close parenthesis, return from recursion). When you first encounter an unescaped double-quote ", read until the next one ", ...
This, coupled with a simple printing method, could be sufficient for your needs.

I think the easiest method would be to use triple quotations. If you say:
print """
(((This is some lisp code))) """
It should work.
You can format your code any way you like within the triple quotes and it will come out the way you want it to.
Best of luck and happy coding!

I made this rudimentary pretty printer once for prettifying CLIPS, which is based on Lisp. Might help:
def clips_pprint(clips_str: str) -> str:
"""Pretty-prints a CLIPS string.
Indents a CLIPS string for easier visual confirmation during development
and verification.
Assumes the CLIPS string is valid CLIPS, i.e. braces are paired.
"""
LB = "("
RB = ")"
TAB = " " * 4
formatted_clips_str = ""
tab_count = 0
for c in clips_str:
if c == LB:
formatted_clips_str += os.linesep
for _i in range(tab_count):
formatted_clips_str += TAB
tab_count += 1
elif c == RB:
tab_count -= 1
formatted_clips_str += c
return formatted_clips_str.strip()

What are `lexpr` and `ApplicationExpression` nltk?

What exactly does lexpr mean and what do the folloring r'/F x.x mean? Also what is Application Expression?
from nltk.sem.logic import *
lexpr = Expression.fromstring
zero = lexpr(r'\F x.x')
one = lexpr(r'\F x.F(x)')
two = lexpr(r'\F x.F(F(x))')
three = lexpr(r'\F x.F(F(F(x)))')
four = lexpr(r'\F x.F(F(F(F(x))))')
succ = lexpr(r'\N F x.F(N(F,x))')
plus = lexpr(r'\M N F x.M(F,N(F,x))')
mult = lexpr(r'\M N F.M(N(F))')
pred = lexpr(r'\N F x.(N(\G H.H(G(F)))(\u.x)(\u.u))')
v1 = ApplicationExpression(succ, zero).simplify()

See http://goo.gl/zog68k, nltk.sem.logic.Expression is:
"""This is the base abstract object for all logical expressions"""
There are many types of logical expressions implemented in nltk. See line 1124, the ApplicationExpression is:
This class is used to represent two related types of logical expressions.
The first is a Predicate Expression, such as "P(x,y)". A predicate expression is comprised of a FunctionVariableExpression or
ConstantExpression as the predicate and a list of Expressions as the arguments.
The second is a an application of one expression to another, such as
"(\x.dog(x))(fido)".
The reason Predicate Expressions are treated as Application Expressions is
that the Variable Expression predicate of the expression may be replaced
with another Expression, such as a LambdaExpression, which would mean that
the Predicate should be thought of as being applied to the arguments.
The logical expression reader will always curry arguments in a application expression.
So, "\x y.see(x,y)(john,mary)" will be represented internally as
"((\x y.(see(x))(y))(john))(mary)". This simplifies the internals since
there will always be exactly one argument in an application.
The str() method will usually print the curried forms of application
expressions. The one exception is when the the application expression is
really a predicate expression (ie, underlying function is an
AbstractVariableExpression). This means that the example from above
will be returned as "(\x y.see(x,y)(john))(mary)".
I'm not exactly an expert in formal logics but your code above is trying to declare a logical function variable x:
>>> from nltk.sem.logic import *
>>> lexpr = Expression.fromstring
>>> zero = lexpr(r'\F x.x')
>>> succ = lexpr(r'\N F x.F(N(F,x))')
>>> v1 = ApplicationExpression(succ, zero).simplify()
>>> v1
<LambdaExpression \F x.F(x)>
>>> print v1
\F x.F(x)
For a crash course, see http://theory.stanford.edu/~arbrad/slides/cs156/lec2-4.pdf and a nltk crash course to lambda expressions, see http://www.cs.utsa.edu/~bylander/cs5233/nltk-intro.pdf

You are looking at a small part of quite a complicted toolkit. I try to give some background from a bit of researching on the web below. Or you can just skip to the "direct answers" section if you like. I'll try to answer your question on the specific part you quote, but I am not an expert on either philosophical logic or natural language processing. The more I read about it, the less I seem to know, but I've included a load of hopefully useful references.
Description of tool / principles/ introduction
The code you've posted is a sub-series of the regression tests for the logic module of the Natural Language toolkit for python (NLTK). This toolkit is described in a fairly accessible academic paper here, seemingly written by the authors of the tool. It describes the motivation for the toolkit and writing the logic module - in a nutshell to help automate interpretation of natural language.
The code you've posted defines a number of logical forms (LFs as they are referred to in the paper I linked). LFs cover statements in First order predicate logic, combined with the lambda operator (i.e. first order lambda calculus). I will not attempt to completely describe First order predicate logic here. There's a tutorial on lambda calculus here.
The code comes from a set of regression tests (i.e. demonstrations that the toolbox works correctly on simple, known exmample tests) on the howto page, demonstrating how the toolbox can be demonstrated by using it to do simple arithmetic operations. They are an exact encoding of this approach to arithmetic via lambda calculus (Wikipedia link) in the nltk toolkit.
The first four are the first four numbers in lambda calculus (Church Encoding). The next four are arithmetic operators - succ (successor), plus (addition), mult (multiplication) and pred (division), You have not got the tests that go along with these, so at the moment, you simply have a number of LFs, followed by one example of Lambda calculus, combining two of these LFs (succ and zero) to get v1. as you have applied succ to zero, the result should be one - and that is what they test for on the howto page - i.e. v1 == one should evaluate True.
Direct answer to python bits
Lets go through the elements of the code you've posted one by one.
lexpr is the function that generates Logical EXPRessions - it is an alias for Expression.fromstring as lexpr = Expression.fromstring
It takes a string argument. The r before the string tells python to interpret it as a raw string literal. For the purposes of this question - that means that we don't have to escape the \ symbol
Within the Strings, \ is the lambda operator.
F denotes a function and x a bound variable in lambda calculus
The . or dot operator separates the bound function from the body of the expression / abstraction
So - to take the string you quote in the question:
r'/F x.x'
It is the Church Encoding of zero. Church encoding is pretty abstract and hard to get your head round. This tutorial might help - I think I'm starting to get it... Unfortunately the example you've chosen is zero and from what I can work out, this is a definition, rather than something you can derive. It can't be "evaluated to 0" in any meaningful sense. This is the simplest explanation I've found. I'm not in a position to comment on its rigour / correctness.
A Church numeral is a procedure that takes one argument, and that argument is itself another procedure that also takes one argument. The procedure zero represents the integer 0 by returning a procedure that applies its input procedure zero times
Finally, the ApplicationExpression is taking one expression and applying it to the other, in this case applying succ (succesor) to zero. This is, aptly, called an application in lambda calculus
EDIT:
Wrote all that and then found a book hidden on the nltk site - Chapter 10 is particularly applicable to this question, with this section describing lambda calculus.

Safe expression parser in Python

How can I allow users to execute mathematical expressions in a safe way?
Do I need to write a full parser?
Is there something like ast.literal_eval(), but for expressions?

The examples provided with Pyparsing include several expression parsers:
https://github.com/pyparsing/pyparsing/blob/master/examples/fourFn.py is a conventional arithmetic infix notation parser/evaluator implementation using pyparsing. (Despite its name, this actually does 5-function arithmetic, plus several trig functions.)
https://github.com/pyparsing/pyparsing/blob/master/examples/simpleBool.py is a boolean infix notation parser/evaluator, using a pyparsing helper method operatorPrecedence, which simplifies the definition of infix operator notations.
https://github.com/pyparsing/pyparsing/blob/master/examples/simpleArith.py and https://github.com/pyparsing/pyparsing/blob/master/examples/eval_arith.py recast fourFn.py using operatorPrecedence. The first just parses and returns a parse tree; the second adds evaluation logic.
If you want a more pre-packaged solution, look at plusminus, a pyparsing-based extensible arithmetic parsing package.

What sort of expressions do you want? Variable assignment? Function evaluation?
SymPy aims to become a full-fledged Python CAS.

Few weeks ago I did similar thing, but for logical expressions (or, and, not, comparisons, parentheses etc.). I did this using Ply parser. I have created simple lexer and parser. Parser generated AST tree that was later use to perform calculations. Doing this in that way allow you to fully control what user enter, because only expressions that are compatible with grammar will be parsed.

Yes. Even if there were an equivalent of ast.literal_eval() for expressions, a Python expression can be lots of things other than just a pure mathematical expression, for example an arbitrary function call.
It wouldn't surprise me if there's already a good mathematical expression parser/evaluator available out there in some open-source module, but if not, it's pretty easy to write one of your own.

maths functions will consist of numeric and punctuation characters, possible 'E' or 'e' if you allow scientific notation for rational numbers, and the only (other) legal use of alpha characters will be if you allow/provide specific maths functions (e.g. stddev). So, should be trivial to run along the string for alpha characters and check the next little bit isn't suspicious, then simply eval the string in a try/except block.
Re the comments this reply has received... I agree this approach is playing with fire. Still, that doesn't mean it can't be done safely. I'm new to python (< 2 months), so may not know the workarounds to which this is vulnerable (and of course a new Python version could always render the code unsafe in the future), but - for what little it's worth (mainly my own amusement) - here's my crack at it:
def evalMaths(s):
i = 0
while i < len(s):
while s[i].isalpha() and i < len(s):
idn += s[i]
i += 1
if (idn and idn != 'e' and idn != 'abs' and idn != 'round'):
raise Exception("you naughty boy: don't " + repr(idn))
else:
i += 1
return eval(s)
I would be very interested to hear if/how it can be circumvented... (^_^) BTW / I know you can call functions like abs2783 or _983 - if they existed, but they won't. I mean something practical.
In fact, if anyone can do so, I'll create a question with 200 bounty and accept their answer.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.