Python parse mathematical text expression - python

I was wondering if anyone knew of a good python library for evaluation text-based mathematical expressions. So for example,
>>> evaluate("Three plus nine")
12
>>> evaluate("Eight + two")
10
I've seen similar examples that people have done for numeric values and operators in a string. One method used eval to compute the literal value of the expression. And another method of doing this used regex to parse the text.
If there isn't an existing library that handles this well I will probably end up using a combination of the regex and eval techniques for this. I just want to confirm that there isn't already something like this already out there.

You could try pyparsing, which does general recursive descent parsing. In fact, here is something quite close to your second example.
About your other suggestions.
See here about the security issues of eval (ironically, using it for a calculator).
Fundamentally, regular languages are weaker than pushdown automata languages. You shouldn't try to fight a general parse tree problem with regexes.

Related

(python - cpp) - How to split the c++ codes while writing a lexical analyzer in python?

I wrote a lexical analyzer for cpp codes in python, but the problem is when I use input.split(" ") it won't recognize codes like x=2 or function() as three different tokens unless I add an space between them manually, like: x = 2 .
also it fails to recognize the tokens at the beginning of each line.
(if i add spaces between each two tokens and also at the beginning of each line, my code works correctly)
I tried splitting the code first by lines then by space but it got complicated and still I wasn't able to solve the first problem.
Also I thought about splitting it by operators, yet I couldn't actually implement it. plus I need the operators to be recognized as tokens as well, so this might not be a good idea.
I would appreciate it if anyone could give any solution or suggestion, Thank You.
f=open("code.txt")
input=f.read()
input=input.split(" ")
f=open("code.txt")
input=f.read()
input1=input.split("\n")
for var in input1:
var=var.split(" ")
Obviously, if you try to have success splitting such an expression like x=2 and also x = 2... it seems pretty obvious that isn't going to work.
What you are looking is for a solution that works with both right?
Basic solution is to use an and operator, and use the conditions that you need to parse. Note that this solution isn't scalable, neither fits into the category of good practices, but it can help you to figure out better but harder solutions.
if input.split(' ') and input.split('='):
An intermediate solution would be to use regex.
Regex isn't an easy topic, but you can checkout online documentation, and then you have wonderful online tools to check your regex codes.
Regex 101
The last one, would be to convert your input data into an AST, which stands for abstract syntax tree. This is the technique employed by C++ compilers like, for example, Clang.
This last one is a real hard topic, so for figure out a basic lexer, probably will be really time consuming, but maybe it could fit your needs.
The usual approach is to scan the incoming text from left to right. At each character position, the lexical analyser selects the longest string which fits some pattern for a "lexeme", which is either a token or ignored input (whitespace and comments, for example). Then the scan continues at the next character.
Lexical patterns are often described using regular expressions, but the standard regular expression module re is not as much help as it could be for this procedure, because it does not have the facility of checking multiple regular expressions in parallel. (And neither does the possible future replacement, the regex module.) Or, more precisely, the library can check multiple expressions in parallel (using alternation syntax, (...|...|...)), but it lacks an interface which can report which of the alternatives was matched. [Note 1]. So it would be necessary to try every possible pattern one at a time and select whichever one turns out to have the longest match.
Note that the matches are always anchored at the current input point; the lexical analyser does not search for a matching pattern. Every input character becomes part of some lexeme, even if that lexeme is ignored, and lexemes do not overlap.
You can write such an analyser by hand for a simple language, but C++ is hardly a simple language. Hand-built lexical analysers most certainly exist, but all the ones I've seen are thousands of lines of not very readable code. So it's usually easier to build an analyzer automatically using software designed for that purpose. These have been around for a long time -- Lex was written almost 50 years ago, for example -- and if you are planning on writing more than one lexical analyser, you would be well advised to investigate some of the available tools.
Notes
The PCRE2 and Oniguruma regex libraries provide a "callout" feature which I believe could be used for this purpose. I haven't actually seen it used in lexical analysis, but it's a fairly recent addition, particularly for Oniguruma, and as far as I can see, the Python bindings for those two libraries do not wrap the callout feature. (Although, as usual with Python bindings to C libraries, documentation is almost non-existent, so I can't say for certain.)

Chunking for Tamil language

I want to use the NLTK chunker for Tamil language (which is an Indic language). However, it says that it doesn't support Unicode because it uses the 'pre' module for regular expressions.
Unresolved Issues
If we use the re module for regular expressions, Python's regular
expression engine generates "maximum recursion depth exceeded" errors
when processing very large texts, even for regular expressions that
should not require any recursion. We therefore use the pre module
instead. But note that pre does not include Unicode support, so
this module will not work with unicode strings.
Any suggestion for a work around or another way to accomplish it?
Chunkers are language-specific, so you need to train one for Tamil anyway. Of course if you are happy with available off-the-shelf solutions (I've got no idea if there are any, e.g. if the link in the now-deleted answer is any good), you can stop reading here. If not, you can train your own but you'll need a corpus that is annotated with the chunks you want to recognize: perhaps you are after NP chunks (the usual case), but maybe it's something else.
Once you have an annotated corpus, look carefully at chapters 6 and 7 of the NLTK book, and especially section 7.3, Developing and evaluating chunkers.. While Chapter 7 begins with the nltk's regexp chunker, keep reading and you'll see how to build a "sequence classifier" that does not rely on the nltk's regexp-based chunking engine. (Chapter 6 is essential for this, so don't skip it).
It's not a trivial task: You need to understand the classifier approach, put the pieces together, probably convert your corpus to IOB format, and finally select features that will give you satisfactory performance. But it is pretty straightforward, and can be carried out for any language or chunking task for which you have an annotated corpus. The only open-ended part is thinking up contextual cues that you can convert into features to help the classifier decide correctly, and experimenting until you find a good mix. (On the up side, it is a much more powerful approach than pure regexp-based solutions, even for ascii text).
You can use LTRC's Shallow Parser for Tamil Language.
You can check online demo, here.

Are regexes an acceptable method for analyzing syntax?

Pardon me if this question is obvious to some, but I'm trying to teach myself how to write an interpreter. I'm doing this in python, I already have a Lexer programmed.
I've got my list of tokens created, where I'm stuck is constructing the parse tree. I have sort of an idea of where to go from here, but I'm not sure if I am thinking correctly.
This is the syntax I have defined in my grammar for a simple arithmetic expression using a regex.
<a_expression> = <identifier | number> <operator> <identifier | number>
BUT, if my parser receives a stream of tokens matching this pattern from my lexer:
<identifier | number> <operator> <identifier | number> <operator> <identifier | number>
How do I go about parsing this, since it has two operators and three operands instead of just two operands?
Moreover, how do I handle n operands and n-1 operators? I feel like this should be done recursively, but I'm not sure if I need to define more Parsers for different types of expressions or where to go from here. Can I match a pattern of n operands and n-1 operators with a regex?
While today's 'regular' expressions aren't strictly relegated to the land of Regular Languages, you'll find that you need a more powerful tool to do what you're trying to do.
Context-Free Grammars are what you want, and there are a few tools for writing CFGs in Python. Most notable is pyparsing, but there's a port of Haskell's Parsec library called Pysec that you could look into, too.
Whether a regular expression is apt to parse your syntax, depends on whether your syntax (i.e. your grammar) is regular, too, or of another Chomsky-class.
For type-0 (unrestricted) grammars you will need a Turing machine.
For type-1 (context-sensitive) you will need a linear bounded automaton (or any of the above).
For type-2 (context-free) you will need a pushdown automaton (or any of the above).
And only type-3 (regular) can be read by regular expressions (or any of the above).
You can find further readings e.g. at wikipedia.
Infix arithmetic with precedence is not a regular language. Regular expressions are only good for parsing regular languages. (Modern regex implementations aren't really just regular expressions, and they can in fact parse most context-free languages… but they will take exponential time for some of them, and it's non-trivial to predict which ones.)
But it is a context-free language. See the Wikipedia article on Context-free grammar for a brief explanation. Context-free grammars are good for parsing both regular languages and context-free languages.
However, many languages that are non-regular don't need the full power of CFG.
Two important classes are those that are LL- or LR-parseable (in linear time). (Variants on these, especially LALR and SLR, are also important.) For example, Python can be (and is, at least in the reference implementation) parsed by an LL(1) parser.
Your language fits into an even more restrictive subset of LR(1), OP. In fact, as the name implies ("OP" is short for "Operator Precedence"), it's the paradigm case. And OP parsers are much easier to write by hand than more general parsers. So, if you're going to build a custom parser from scratch, that's what you'd probably want to use here.

Python: regex vs find(), strip()

I am learning Python, and need to format "From" fields received from IMAP. I tried it using str.find() and str.strip(), and also using regex. With find(), etc. my function runs quite a bit faster than with re (I timed it). So, when is it better to use re? Does anybody have any good links/articles related to that? Python documentation obviously doesn't mention that...
find only matches an exact sequence of characters, while a regular expression matches a pattern. Naturally only looking an for exact sequence is faster (even if your regex pattern is also an exact sequence, there is still some overhead involved).
As a consequence of the above, you should use find if you know the exact sequence, and a regular expression (or something else) when you don't. The exact approach you should use really depends on the complexity of the problem you face.
As a side note, the python re module provides a compile method that allows you to pre-compile a regex if you are going to be using it repeatedly. This can substantially improve speed if you are using the same pattern many times.
If you intend to do something complex you should use re . It is more scalable than using string methods.
String methods are good for doing something simple and not worth bothering with regular expressions.
So, it depends on what are you doing, but usually you should use regular expressions since they are more powerful.

library for transforming a node tree

I'd like to be able to express a general transformation of one tree into another without writing a bunch of repetitive spaghetti code. Are there any libraries to help with this problem? My target language is Python, but I'll look at other languages as long as it's feasible to port to Python.
Example: I'd like to transform this node tree: (please excuse the S-expressions)
(A (B) (C) (D))
Into this one:
(C (B) (D))
As long as the parent is A and the second ancestor is C, regardless of context (there may be more parents or ancestors). I'd like to express this transformation in a simple, concise, and re-usable way. Of course this example is very specific. Please try to address the general case.
Edit: RefactoringNG is the kind of thing I'm looking for, although it introduces an entirely new grammar to solve the problem, which i'd like to avoid. I'm still looking for more and/or better examples.
Background:
I'm able to convert python and cheetah (don't ask!) files into tokenized tree representations, and in turn convert those into lxml trees. I plan to then re-organize the tree and write-out the results in order to implement automated refactoring. XSLT seems to be the standard tool to rewrite XML, but the syntax is terrible (in my opinion, obviously) and nobody at our shop would understand it.
I could write some functions which simply use the lxml methods (.xpath and such) to implement my refactorings, but I'm worried that I will wind up with a bunch of purpose-built spaghetti code which can't be re-used.
Let's try this in Python code. I've used strings for the leaves, but this will work with any objects.
def lift_middle_child(in_tree):
(A, (B,), (C,), (D,)) = in_tree
return (C, (B,), (D,))
print lift_middle_child(('A', ('B',), ('C',), ('D',))) # could use lists too
This sort of tree transformation is generally better performed in a functional style - if you create a bunch of these functions, you can explicitly compose them, or create a composition function to work with them in a point-free style.
Because you've used s-expressions, I assume you're comfortable representing trees as nested lists (or the equivalent - unless I'm mistaken, lxml nodes are iterable in that way). Obviously, this example relies on a known input structure, but your question implies that. You can write more flexible functions, and still compose them, as long as they have this uniform interface.
Here's the code in action: http://ideone.com/02Uv0i
Now, here's a function to reverse children, and using that and the above function, one to lift and reverse:
def compose2(a,b): # might want to get this from the functional library
return lambda *x: a(b(*x))
def compose(*funcs): #compose(a,b,c) = a(b(c(x))) - you might want to reverse that
return reduce(compose2,funcs)
def reverse_children(in_tree):
return in_tree[0:1] + in_tree[1:][::-1] # slightly cryptic, but works for anything subscriptable
lift_and_reverse = compose(reverse_children,lift_middle_child) # right most function applied first - if you find this confusing, reverse order in compose function.
print lift_and_reverse(('A', ('B',), ('C',), ('D',)))
What you really want IMHO is an program transformation system, which allows you to parse and transform code using the patterns expressed in the surface syntax of the source code (and even the target language) to express the rewrites directly.
You will find that even if you can get your hands on an XML representation of the Python tree, that the effort to write an XSLT/XPath transformation is more than you expect; trees representing real code are messier than you'd expect, XSLT isn't that convenient a notation, and it cannot express directly common conditions on trees that you'd like to check (e.g., that two subtrees are the same). An final complication with XML: assume its has been transformed. How do you regenerate the source code syntax from which came? You need some kind of prettyprinter.
A general problem regardless of how the code is represented is that without information about scopes and types (where you can get it), writing correct transformations is pretty hard. After all, if you are going to transform python into a language that uses different operators for string concat and arithmetic (unlike Java which uses "+" for both), you need to be able to decide which operator to generate. So you need type information to decide. Python is arguably typeless, but in practice most expressions involve variables which have only one type for their entire lifetime. So you'll also need flow analysis to compute types.
Our DMS Software Reengineering Toolkit has all of these capabilities (parsing, flow analysis, pattern matching/rewriting, prettyprinting), and robust parsers for many languages including Python. (While it has flow analysis capability instantiated for C, COBOL, Java, this is not instantiated for Python. But then, you said you wanted to do the transformation regardless of context).
To express your rewrite in DMS on Python syntax close to your example (which isn't Python?)
domain Python;
rule revise_arguments(f:IDENTIFIER,A:expression,B:expression,
C:expression,D:expression):primary->primary
= " \f(\A,(\B),(\C),(\D)) "
-> " \f(\C,(\B),(\D)) ";
The notation above is the DMS rule-rewriting language (RSL). The "..." are metaquotes that separate Python syntax (inside those quotes, DMS knows it is Python because of the domain notation declaration) from the DMS RSL language. The \n inside the meta quote refers to the syntax variable placeholders of the named nonterminal type defined in the rule parameter list. Yes, (...) inside the metaquotes are Python ( ) ... they exist in the syntax trees as far as DMS is concerned, because they, like the rest of the language, are just syntax.
The above rule looks a bit odd because I'm trying to follow your example as close as possible, and from and expression language point of view, your example is odd precisely because it does have unusual parentheses.
With this rule, DMS could parse Python (using its Python parser) like
foobar(2+3,(x-y),(p),(baz()))
build an AST, match the (parsed-to-AST) rule against that AST, rewrite it to another AST corresponding to:
foobar(p,(x-y),(baz()))
and then prettyprint the surface syntax (valid) python back out.
If you intended your example to be a transformation on LISP code, you'd
need a LISP grammar for DMS (not hard to build, but we don't have much
call for this), and write corresponding surface syntax:
domain Lisp;
rule revise_form(A:form,B:form, C:form, D:form):form->form
= " (\A,(\B),(\C),(\D)) "
-> " (\C,(\B),(\D)) ";
You can get a better feel for this by looking at Algebra as a DMS domain.
If your goal is to implement all this in Python... I don't have much help.
DMS is a pretty big system, and it would be a lot of effort to replicate.

Categories

Resources